Data Analyst Job Postings SQL Analysis¶

by Daria Yuferova

Project Description¶

This project is based on Data Analyst Job Postings [Pay, Skills, Benefits] dataset by Luke Barousse and his analysis (see license).

In his data collection process Luke uses SerpAPI to pull job postings from Google's search results for Data Analyst positions in the United States. Data collection started on November 4th, 2022, and the results are updated daily - check it out, as well as his wonderful YouTube channel! Thank you Luke for everything you're doing for Data Analytics community! 🙌

When examining Luke's results, I was surprised to see SQL (not Python!) at the very top of the skills for Data Analysts. So it only made sense for me to try and process this data using the employers' favorite tool!

The purpose of this project is mainly to practice my SQL skills, thus, this notebook will only cover the process of analysis using this language. For the prerequisite Python code used you can refer to my Github repository. Again, the code is mostly Luke's intellectual property, I just tweaked a couple of things and exported a CSV file with a few columns I needed.

STEP 1. Uploading Data to SQL Database and Preparing Data¶

Prepare all the necessary packages to work with SQL (and some Python) in Jupyter Notebook:

In [1]:
# pip install mysqlclient # to run SQL queries inside Jupyter Notebook with %%sql magic ✨
In [2]:
# %pip install jupysql --upgrade  # to transform SQL query results into pandas dataframes
In [3]:
%load_ext sql
In [4]:
# %sql  mysql:// # you need to enter your credentials here to connect to your SQL database
In [6]:
import MySQLdb
import pandas as pd
import plotly.express as px

Since we have job postings data in CSV file format (exported from the Python script), we need to import the data into an SQL table to start analyzing it with SQL.
First we create a template for our future table specifying expected columns and datatypes:

In [7]:
%%sql

CREATE TABLE IF NOT EXISTS jobs (
    title VARCHAR(255),
    company_name VARCHAR(255),
    location VARCHAR(255),
    via VARCHAR(255),
    date_time DATETIME,
    salary_standardized VARCHAR(255),
    description_tokens TEXT
);
Running query in 'mysql://root:***@localhost/postings'
Out[7]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Then we load the CSV file into a new table:

In [8]:
%%sql

LOAD DATA INFILE 'C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/jobs_sql.csv'
INTO TABLE jobs
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS;
Running query in 'mysql://root:***@localhost/postings'
23745 rows affected.
Out[8]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

We will need to transform the table for our needs: sort it by time of job posting, add an ID column, handle salary and description tokens columns.
First sort the table chronologically:

In [9]:
%%sql

ALTER TABLE jobs
ORDER BY date_time ASC;
Running query in 'mysql://root:***@localhost/postings'
23745 rows affected.
Out[9]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Add an ID column based on chronological order:

In [10]:
%%sql

ALTER TABLE jobs
ADD ID INT AUTO_INCREMENT PRIMARY KEY FIRST;
Running query in 'mysql://root:***@localhost/postings'
Out[10]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Change the salary_standardized column data type to float, paying attention to null values (reminder - as of now we have salary_standardized column as string and empty strings cannot be directly converted to float).

In [11]:
%%sql

UPDATE jobs
SET salary_standardized = null
WHERE salary_standardized = '';
Running query in 'mysql://root:***@localhost/postings'
19249 rows affected.
Out[11]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [12]:
%%sql

ALTER TABLE jobs
MODIFY salary_standardized FLOAT(53);
Running query in 'mysql://root:***@localhost/postings'
23745 rows affected.
Out[12]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Let's also prepare the skills (description_tokens) column for future analysis by removing blank spaces:

In [13]:
%%sql

UPDATE jobs
SET description_tokens = REPLACE(description_tokens, ', ', ',');
Running query in 'mysql://root:***@localhost/postings'
23745 rows affected.
Out[13]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Check the results:

In [14]:
%%sql

DESCRIBE jobs;
Running query in 'mysql://root:***@localhost/postings'
8 rows affected.
Out[14]:
Field Type Null Key Default Extra
ID int NO PRI None auto_increment
title varchar(255) YES None
company_name varchar(255) YES None
location varchar(255) YES None
via varchar(255) YES None
date_time datetime YES None
salary_standardized double YES None
description_tokens text YES None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [15]:
%%sql

SELECT *
FROM jobs
LIMIT 5;
Running query in 'mysql://root:***@localhost/postings'
5 rows affected.
Out[15]:
ID title company_name location via date_time salary_standardized description_tokens
1 DATA ANALYST w/INFORMATICA EXPERIENCE Business Advisory Solutions Group (BASG) United States via Big Country Jobs 2022-11-04 03:40:11 None integrity
2 Data Analyst II EDWARD JONES Bates City, MO via My ArkLaMiss Jobs 2022-11-04 03:40:11 103781.0 power bi,sas,critical thinking,r,excel,sql,decision making,presentation skills,spss,python,snowflake
3 Data Analyst III - Full Time - Days Mercy Springfield, MO via Careers At Mercy 2022-11-04 03:40:11 None communication,go,excel,word,outlook,powerpoint
4 Sr. Business Data Analyst New American Funding United States via KSNT Jobs 2022-11-04 03:40:11 100000.0 communication,power bi,tableau,ssrs,sql,snowflake,ssis
5 Sr. Data Analyst Medasource United States via KSNT Jobs 2022-11-04 03:40:11 110000.0 communication,azure,interpersonal skills,excel,sql,integrity,problem solving
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Everything is in order!

STEP 2. Exploratory Data Analysis¶

Basic data exploration was previously made (see Python code) but we will also explore the data with SQL.
Let's check how many job postings include salary data:

In [16]:
%%sql

SELECT COUNT(*) AS n_of_postings, COUNT(DISTINCT salary_standardized) AS salary_data,
       COUNT(DISTINCT salary_standardized) / COUNT(*) AS salary_data_share
FROM jobs ;
Running query in 'mysql://root:***@localhost/postings'
1 rows affected.
Out[16]:
n_of_postings salary_data salary_data_share
23745 542 0.0228
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Less than 3% of all recorded job postings come with salary information - not too much for a set of over 20k job postings.
Check the limits of salary data:

In [17]:
%%sql

SELECT MIN(salary_standardized), MAX(salary_standardized)
FROM jobs;
Running query in 'mysql://root:***@localhost/postings'
1 rows affected.
Out[17]:
MIN(salary_standardized) MAX(salary_standardized)
18720.0 624000.0
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

The main reason for such ~wild~ wide range of values is that our dataset includes not only full-time jobs, but also part-time, freelance, project positions - where the compensation offered rarely scales correctly for annual format.
Let's find top 10 most popular job titles:

In [18]:
%%sql

SELECT title, COUNT(*) AS freq
FROM jobs
GROUP BY title
ORDER BY freq DESC
LIMIT 10;
Running query in 'mysql://root:***@localhost/postings'
10 rows affected.
Out[18]:
title freq
Data Analyst 3017
Senior Data Analyst 830
Data Analyst II 385
Lead Data Analyst 304
Sr. Data Analyst, Marketing Operations 271
Business Data Analyst 251
Data Scientist 203
Data Analyst III 180
Sr. Data Analyst 130
Marketing Data Analyst 126
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Interesting. Out TOP 10 is mostly presented by Data Analysts of different levels and domains.
But are they also the TOP 10 job titles with best compensation?

In [19]:
%%sql

SELECT title, ROUND(AVG(salary_standardized),1) AS avg_salary
FROM jobs
GROUP BY title
ORDER BY avg_salary DESC
LIMIT 10;
Running query in 'mysql://root:***@localhost/postings'
10 rows affected.
Out[19]:
title avg_salary
Freelance Data Analyst 624000.0
Consulting from Data buyer 572000.0
Looking for a EXPERT Equity Research Analyst - Contract to Hire 468000.0
Data Analyst for Ecommerce and Marketing Agency 322400.0
Analyze census data 312000.0
Cluster Analysis Expert 312000.0
Designing an Excel Scorecard for our Company 286000.0
Amazon PPC Data Analyst 260000.0
PLS, ensamble models, decision tree algorithms 260000.0
Market research data analyst - Contract to Hire 260000.0
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Again, we see unusually high salaries and unusual titles. Let's make sure we're not looking at the freelance postings coming via Upwork, as they tend to be recorded with poorly scalable hourly/by project salary data:

In [20]:
%%sql

SELECT title, via, ROUND(AVG(salary_standardized),1) AS avg_salary
FROM jobs
WHERE via != 'via Upwork'
GROUP BY title, via
ORDER BY avg_salary DESC
LIMIT 10;
Running query in 'mysql://root:***@localhost/postings'
10 rows affected.
Out[20]:
title via avg_salary
Principal Data Engineer/Architect via LinkedIn 260000.0
SAP S4 HANA BRIM Analyst via LinkedIn 236683.2
Clinical Mapping Data Analyst via LinkedIn 235040.0
Contract Data Scientist/Engineer - $125 p/h via LinkedIn 234000.0
Data Scientist, Analytics via Ai-Jobs.net 233500.0
Lead Cybersecurity Data Analyst via Indeed 232500.0
Founding Data Scientist via LinkedIn 225000.0
Director Data Scientist - Commercial Platforms (P2474). via Ai-Jobs.net 221875.0
Data Analytics & Engineering - Data Analyst V Data Analyst V via Mindlance - Talentify 220480.0
Data Engineer (Hybrid) via Ai-Jobs.net 217500.0
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Now we encounter more Data Science and Data Engineering positions on top!
We see very very different and personalized titles used, and it's hard to draw any conclusions on which areas of Data Analytics are paid for the most.
Let's create wider title groups:

In [21]:
%%sql

SELECT
CASE
    WHEN title LIKE '%senior%' OR title LIKE '%lead%' OR title LIKE '%sr%' 
                               OR title LIKE '%analyst IV%' THEN 'Senior Data Analyst'
    WHEN title LIKE '%mid data analyst%' OR title LIKE '%mid%' 
                                         OR title LIKE '%analyst II%' THEN 'Middle Data Analyst'
    WHEN title LIKE '%junior data analyst%' OR title LIKE '%analyst I%' THEN 'Junior Data Analyst'
    WHEN title LIKE '%data engineer%' THEN 'Data Engineer'
    WHEN title LIKE '%data scientist%' THEN 'Data Scientist'
    WHEN title LIKE '%data analyst%' THEN 'Data Analyst'
    WHEN title LIKE '%analyst%' THEN 'Analyst'
    ELSE 'Other'
END AS title_cat, ROUND(AVG(salary_standardized),1) AS avg_salary
FROM jobs
GROUP BY title_cat
ORDER BY avg_salary DESC;
Running query in 'mysql://root:***@localhost/postings'
8 rows affected.
Out[21]:
title_cat avg_salary
Data Engineer 129481.3
Data Scientist 125937.8
Senior Data Analyst 108351.7
Analyst 101062.2
Middle Data Analyst 92079.6
Data Analyst 89800.9
Other 89785.5
Junior Data Analyst 58249.9
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Again, Data Engineers and Data Scientist are more lucky with generous offers. This data would be most interesting for Junior Data Analysts entering the field, as they can see the average reference level of salaries mentioned in the US job postings over the last 8 months.

For further analysis it would be interesting to split the dataset by time periods and check if there is any trend in the average salary over time, provided we have the data covering a longer period.

Next up - companies: who pays more VS who offers more jobs?

In [22]:
%%sql

SELECT company_name, ROUND(AVG(salary_standardized),1) AS avg_salary, COUNT(*) AS n_of_postings
FROM jobs
GROUP BY company_name
ORDER BY n_of_postings DESC -- or ORDER BY avg_salary DESC
LIMIT 10;
Running query in 'mysql://root:***@localhost/postings'
10 rows affected.
Out[22]:
company_name avg_salary n_of_postings
Upwork 87550.2 3864
EDWARD JONES 122218.9 727
Walmart 110500.0 703
Corporate 58240.0 606
Cox Communications 97068.9 510
Talentify.io None 414
Dice None 258
Insight Global 101845.2 214
Staffigo Technical Services, LLC None 167
Centene Corporation None 117
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Most job offers come from the freelance platform Upwork or big players like Walmart. Even among the most active companies, there are those that do not reveal any salary data in their job postings.

STEP 3. Interactive Visualizations with Plotly¶

Now that we have gotten familiar with the data presented, let's look into the skills mentioned the most in the job posting descriptions and compensation offered for the said skills.
We will also visualize the findings using Plotly - interactive graphic library for Python.
Visualizing SQL queries is not a trivial task, but the JupySQL tool makes it possible by linking the results of SQL queries with a Pandas (or Polars) dataframe. I find it very inspiring to work with cross-language tools that make life easier for data analysts!

Iterating through lists and counting instances using Python's list comprehension is easy and elegant.
In SQL, however, I managed to do this by creating a skills table (a poor substitute for normalization, but alas) and using the FIND_IN_SET() function, which returns position of a string in a list, to join the tables (for this reason we needed to remove all spaces in the description_tokens column, as the function requires a comma-separated list only).

First, we create a new table in our database and fill it with the skills of my choice (that I am interested to analyze) and skill category - hard or soft:

In [23]:
%%sql

CREATE TABLE IF NOT EXISTS skills (
    ID INT AUTO_INCREMENT PRIMARY KEY,
    skill VARCHAR(255),
    type VARCHAR(255)
);
Running query in 'mysql://root:***@localhost/postings'
Out[23]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [24]:
%%sql

INSERT INTO skills (skill, type)
VALUES 
    ('python', 'hard skill'), ('sql', 'hard skill'), ('r', 'hard skill'), ('power bi', 'hard skill'),
    ('tableau', 'hard skill'), ('github', 'hard skill'), ('machine learning', 'hard skill'), ('aws', 'hard skill'),
    ('data lake', 'hard skill'), ('excel', 'hard skill'), ('spss', 'hard skill'), ('airflow', 'hard skill'), 
    ('jupyter', 'hard skill'), ('azure', 'hard skill'), ('bigquery', 'hard skill'),
    ('communication', 'soft skill'), ('flexibility', 'soft skill'), ('initiative', 'soft skill'),
    ('integrity', 'soft skill'), ('critical thinking', 'soft skill'), ('decision making', 'soft skill'),
    ('problem solving', 'soft skill'), ('analytical thinking', 'soft skill'), ('strategic thinking', 'soft skill'),
    ('presentation skills', 'soft skill')
Running query in 'mysql://root:***@localhost/postings'
25 rows affected.
Out[24]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

We save the SQL query calculating the number of mentions and average salary in those mentions per skill and convert it to Pandas dataframe for plotting:

In [25]:
%%sql --save skills_pay

SELECT s.skill, s.type, COUNT(*) AS n_of_postings, ROUND(AVG(j.salary_standardized),0) AS avg_salary
FROM jobs AS j 
INNER JOIN skills AS s
ON FIND_IN_SET(s.skill, j.description_tokens)
GROUP BY s.skill, s.type
ORDER BY n_of_postings DESC;
Running query in 'mysql://root:***@localhost/postings'
25 rows affected.
Out[25]:
skill type n_of_postings avg_salary
sql hard skill 12117 100755.0
communication soft skill 9832 97872.0
excel hard skill 8197 89336.0
python hard skill 6637 106027.0
tableau hard skill 6612 99766.0
power bi hard skill 6104 98281.0
r hard skill 4248 102358.0
integrity soft skill 3017 95300.0
flexibility soft skill 2246 105609.0
decision making soft skill 1984 102368.0
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration
In [26]:
# save SQL query as CTE:
result = %sql SELECT * FROM skills_pay
# save CTE as dataframe:
df = result.DataFrame()
Generating CTE with stored snippets: 'skills_pay'
Running query in 'mysql://root:***@localhost/postings'
25 rows affected.

Now, let's plot the results for hard and soft skills. The graph is interactive - try hovering over the bars:

In [27]:
fig = px.bar(df, x='skill', y='n_of_postings', color='avg_salary',
             facet_row='type', width=800, height=600,
             hover_data={'avg_salary':':$,', 'n_of_postings':':,'},
             labels={'n_of_postings': 'job postings',
                     'avg_salary' : 'average salary'},
             title='Popularity and Worth of Skills in Data Analyst Job Postings')
fig.show()

We can see that the most frequently mentioned technical skill for Data Analysts is SQL (still going strong!), and it pays around $100k per year. Excel, being the second most popular hard skill, is one of the least compensated on average. Python (pays well!), Tableau and Power BI round out the TOP 5.
It's interesting to see that more advanced concepts and specific tools are associated with higher pay but are rarely mentioned (e.g., airflow, AWS, machine learning).

Both in terms of pay and the number of mentions, we see that soft skills lag behind, which is understandable given the technical nature of Data Analytics. However, among soft skills described in job postings, communication is by far the most popular - representing the bridge between the analytical part of the job and its storytelling aspect.

Finally, let's check if there is a connection between the job location and the number of postings/average salary offered:

In [28]:
%%sql

SELECT location, ROUND(AVG(salary_standardized),1) AS avg_salary, COUNT(*) AS n_of_postings
FROM jobs
GROUP BY location
ORDER BY n_of_postings DESC -- or ORDER BY avg_salary DESC
LIMIT 10;
Running query in 'mysql://root:***@localhost/postings'
10 rows affected.
Out[28]:
location avg_salary n_of_postings
Anywhere 94046.9 10714
United States 104499.4 5889
Kansas City, MO 83575.8 860
Oklahoma City, OK 87415.0 600
Jefferson City, MO 91201.3 507
Tulsa, OK 91960.7 348
Bentonville, AR 98365.2 292
Wichita, KS 75741.1 272
Columbia, MO 88666.2 231
Overland Park, KS 91905.6 213
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

Most of the job postings don't specify a location, which might indicate a prevalence of remote work. As for the job postings with specific locations, it's hard to see any trends here.
I would expect to see a connection between state population and job postings activity (more people - more job opportunities), as well as between GDP by state and average salary offered there. Let's check these hypotheses.
For that, we add the population and GDP data (by US state, from worldpopulationreview.com) to the analysis. We will use GDP data for the last quarter of 2022 to see the differences between states more clearly.
Again, first creating an SQL table and uploading CSV file data:

In [29]:
%%sql

CREATE TABLE IF NOT EXISTS states_gdp (
    state VARCHAR(255),
    gdpPerCapita2022 DOUBLE,
    n VARCHAR(255),
    gdpQ42022 DOUBLE,
    pop2023 DOUBLE
    );
Running query in 'mysql://root:***@localhost/postings'
Out[29]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [30]:
%%sql

LOAD DATA INFILE 'C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/states_gdp.csv'
INTO TABLE states_gdp
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS;
Running query in 'mysql://root:***@localhost/postings'
51 rows affected.
Out[30]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Also we will need a converting US state name-abbreviation table (from worldpopulationreview.com), since in our jobs table we have abbreviations:

In [31]:
%%sql

CREATE TABLE IF NOT EXISTS states_abbr (
    abbr VARCHAR(255),
    state VARCHAR(255)
);
Running query in 'mysql://root:***@localhost/postings'
Out[31]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [32]:
%%sql

LOAD DATA INFILE 'C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/states_abbr_name.csv'
INTO TABLE states_abbr
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n';
Running query in 'mysql://root:***@localhost/postings'
51 rows affected.
Out[32]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

The last piece of the puzzle is geodata on state coordinates for our visualization (from Google Github repository, see license):

In [33]:
%%sql

CREATE TABLE IF NOT EXISTS states_coord (
    abbr VARCHAR(255),
    lat DOUBLE,
    lon DOUBLE,
    state VARCHAR(255)
);
Running query in 'mysql://root:***@localhost/postings'
Out[33]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
In [34]:
%%sql

LOAD DATA INFILE 'C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/states_coord.csv'
INTO TABLE states_coord
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
Running query in 'mysql://root:***@localhost/postings'
52 rows affected.
Out[34]:
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Now we're ready for joining all the tables on the US state name/abbreviation columns. We want to get one table containing: state name with coordinates, population, gdp, number and average salary of job postings.

In [35]:
%%sql --save states

SELECT g.state AS state, c.lat, c.lon, ROUND(AVG(g.pop2023),0) AS population,
        g.gdpQ42022*1000000 AS gdp,
        ROUND(AVG(j.salary_standardized),1) AS avg_salary, COUNT(*) AS n_of_postings
FROM jobs AS j
INNER JOIN states_abbr AS a
ON RIGHT(j.location, 2) = a.abbr
INNER JOIN states_gdp AS g
ON a.state = g.state
INNER JOIN states_coord AS c
ON g.state = c.state
GROUP BY 1,2,3,5;
Running query in 'mysql://root:***@localhost/postings'
14 rows affected.
Out[35]:
state lat lon population gdp avg_salary n_of_postings
Missouri 37.964253 -91.831833 6186091.0 358840000000.0 105743.9 2773
Oklahoma 35.007752 -97.092877 4048375.0 226304000000.0 94098.2 1508
Arkansas 35.20105 -91.831833 3063152.0 162459000000.0 96833.8 1107
Rhode Island 41.580095 -71.477429 1090483.0 72935000000.0 115880.4 91
Kansas 39.011902 -98.484246 2936378.0 180925000000.0 91736.9 1390
California 36.778261 -119.417932 38915693.0 3104993000000.0 112115.6 114
Massachusetts 42.407211 -71.382437 6974258.0 612246000000.0 42820.0 40
Texas 31.968599 -99.901813 30500280.0 1919867000000.0 98378.0 37
District of Columbia 38.905985 -77.033418 674815.0 66093000000.0 None 2
Nebraska 41.492537 -99.901813 1972292.0 127422000000.0 35796.8 1
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration
In [36]:
# save SQL query as CTE:
result1 = %sql SELECT * FROM states
# save CTE as dataframe:
df1 = result1.DataFrame()
Generating CTE with stored snippets: 'states'
Running query in 'mysql://root:***@localhost/postings'
14 rows affected.

The JupySQL message shows us that there are only 14 US states mentioned in our dataset (one of which is District of Columbia). The other postings either have "United States" or "Anywhere" search location, as we saw earlier.
Let's put our data on the US map to examine if there is a connection between the state population and Data Analyst job postings activity there (again - the map is interactive!):

In [37]:
fig = px.scatter_mapbox(df1, lat="lat", lon="lon", hover_name='state',
                        hover_data={'lat':False, 'lon':False, 'n_of_postings':False,
                                    'gdp':':$,.0f', 'population':':,', 'avg_salary':':$,'},
                        zoom=3, height=300, size='gdp', color='avg_salary',
                       labels = {'avg_salary' : 'average salary', 'gdp' : 'GDP'},
                       title = 'GDP by US State VS Average Salary in Data Analyst Job Postings')
fig.update_layout(mapbox_style="open-street-map", mapbox_center = {"lat": 37.0902, "lon": -95.7129})
fig.update_layout(margin={"r":0,"t":30,"l":0,"b":0})
fig.show()

The size of data points represents GDP by state in Q4 2022, color gradient - average salary, and state population can be seen when hovering over the map.

In [38]:
fig = px.scatter_mapbox(df1, lat="lat", lon="lon", hover_name='state',
                        hover_data={'lat':False, 'lon':False, 
                                    'population':':,', 'n_of_postings':':,', 'avg_salary':':$,'},
                        zoom=3, height=300, size='population', color='n_of_postings',
                       labels = {'avg_salary' : 'average salary', 'n_of_postings' : 'job postings'},
                       title = 'US States Population VS Number of Data Analyst Job Postings')
fig.update_layout(mapbox_style="open-street-map", mapbox_center = {"lat": 37.0902, "lon": -95.7129})
fig.update_layout(margin={"r":0,"t":30,"l":0,"b":0})
fig.show()

The size of data points represents state population, color gradient - the number of job postings, and average salary can be seen when hovering over the map.

We see on the two maps that, despite having a small population, Missouri leads the way in terms of Data Analyst job openings while maintaining a high salary offered. Neighbouring states of Arkansas, Oklahoma, Kansas have also been active hiring data scientists. Big East Coast players such as New York, New Jersey, Pennsylvania tend to be very selective in job postings and do not disclose salary data.
The examples of Califronia, Texas or New York with their high population but only few job postings show that my hypothesis is not confirmed. Clearly, for any significant conclusions we need more data but it is very interesting (and might be beneficial for analysts looking for a job) to see the behavior of local companies when hiring.

For now, let's investigate a little, why Missouri is such an active data analysts "employer", both in number of job postings, and in average salary offered:

In [39]:
%%sql

SELECT company_name, COUNT(*) AS n_of_postings, ROUND(AVG(salary_standardized), 1) AS avg_salary
FROM jobs
WHERE RIGHT(location, 2) = 'MO'
GROUP BY company_name
ORDER BY n_of_postings DESC
LIMIT 10;
Running query in 'mysql://root:***@localhost/postings'
10 rows affected.
Out[39]:
company_name n_of_postings avg_salary
EDWARD JONES 701 123017.9
Corporate 533 58240.0
Walmart 91 None
State of Missouri 75 57543.2
Commercial Solutions 53 None
Radancy 50 None
COMMERCE BANK 50 90000.0
Centene Corporation 47 None
Government Employees Health Association, Inc. - GEHA 37 None
Cynet Systems 35 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
Truncated to displaylimit of 10
If you want to see more, please visit displaylimit configuration

It looks like EDWARD JONES, a financial services firm headquartered in St. Louis, Missouri[1], is actively looking for Data Analysts for their many branch offices across the state. It is nice to see a bunch of public office positions open for data scientists as well.

Conclusion¶

In this project I got a chance to use my MySQL, JupySQL, Python, Plotly skills for a simple but practical analysis. We learned the scope of actual Data Analysts job postings in the United States over the last months, with rankings, job titles and skills analysis, salary data and geodata.
Having background in Compensation and Benefits (HR), it is fascinating for me to see job postings providing real data-driven insights for those who, like myself, are interested in this career.

The End. And, again, check out Luke's channel for more inspiration! :)